Demonstration Video


Introduction

Our design is a Raspberry Pi-based intelligent assistant that can do freestyle rapping about a certain topic given by the voice command from the user. It will automatically detect the trigger word(its name "Andrew"), transform voice collected from microphone to text, and then understand the topic from a natural language sentence, generate related lyrics and background beat, and eventually play them with the speaker. We also implemented the screen display, which could show the dialogue content and signal waves as background. And finally we optimize the dialogue-based interaction so that it will speak out the current weather of a given location. It is an embedded device with microphone and speaker as input and output, and can interact with users using voice and language processing algorithms.


Generic placeholder image

Project Objective:

  • Wait quietly and wake up when its name is called
  • Understand your voice command and make reactions accordingly
  • Generate hip-hop lyrics on-the-fly and rap it with the machine-generated music

Design & Testing

  1. Wake word detection
    Build up a wake word listener to continually listen to sounds around the device, and activate when the sounds or speech match a wake word. First, we use chunking to calculate the MFCC features of the speech real-time, and then input the generated features into a neural network consisting of 20 gated recurrent units (GRUs), finally making predictions every chunk to check whether it is the wake word or not.

  2. Speech to Text & Text to Speech
    Once the wake word is detected, the speech-to-text function will be triggered to record and convert voice to text. Considering the performance of Raspberry Pi, we chose to use mature online service to do the recognition. By continuously sending the audio chunks to the Google Cloud Speech API, we are able to get the real-time recognized text.

  3. Topic understanding
    With the text recognized, we need the Raspberry Pi to understand the content of the text, so that it can make correct reactions to the user’s command. We make the use of Microsoft Azure Language Understanding service (LUIS) to extract user’s intent and the corresponding entities from the recognized sentence. In our system, we defined two major intents: “Freestyle”, “Weather” and for other topics, currently we just ignore them. With LUIS, we are able to get the entities in the sentence, for example, if we ask about the weather in a specific city, the city name will be returned as an entity. So we are able to flexibly process these intents.

  4. Rap lyrics generation
    In general, we used a character-level Recurrent Neural Networks with LSTM unit. We chose character-level representation because:
    • it does not require tokenization as a preprocessing step
    • it does not require unknown word handling
    • it could generate on a comparatively small vocabulary, less memory
    • it could mimic grammatically correct sequences for a wide range of languages
    • it also include punctuations to make pause of lyrics more natural
    We chose the LSTM (long short-term memory) unit because it could take more context into consideration and avoid vanishing gradient at the same time.

    We picked Eminem as the imitation object of our model because according to a study(conducted by lyrics site Musixmatch), Eminem has the largest vocabulary in the music industry. I found a lyrics dataset scraped from LyricsFreak, which includes 70 Eminem songs.

    We first combine those entries into a large 200k-character string with 50 unique characters, then cut the text into semi-redundant sequences of characters and the vectorize them into the input sequence, the output sequence is the next character of this sequence in the corpus.
    The model is built on keras example, consists of a linear stack of long short-term memory layer and a regular fully-connected neural network layer. Because of the limit of computational resources, each epoch takes about 1 minute. The model is trained 1200 times and this part cost us 30 hours in total.

  5. Beat generation
    The deeping learning AI team provided a dataset, which preprocessed the musical data so that we could render it in terms of musical "values." Each value can be considered as a note, which comprises a pitch and duration.
    Similar to the text generation model, the beat generation is also learnt by a LSTM network. The architecture of the model is illustrated in the figure below. The difference between the lyrics model and the beat model is the first input is randomly generated rather than given by the user.


Issues


Drawings

Final demo


Result

We basically followed our expected time schedule and accomplished the basic functions we proposed in the first proposal. In addition, this freestyle Pi can implement some dialogue-based interactions, for example, it can tell you the weather of a given location and the most handsome man in the world, so we consider this project as a success. Future work is needed to make this voice assistant more intelligent.


Future Work


Work Distribution

Generic placeholder image

Project group picture

Generic placeholder image

Xin

xf78@cornell.edu

Designed the overall software architecture (Just being himself).

Generic placeholder image

Yangmengyuan Zhao

yz2453@cornell.edu

Lyrics and Beat Generation

Figure Design and Vedio Edition


Project Parts

Parts From Cost
Raspberry Pi Lab $0.00
Speaker Lab $0.00
PS3 Eye Microphone Amazon $8.53

Total: $8.53


Acknowledgements

We really appreciate everyone who has helped us building this project:

References

The Largest Vocabulary In Music
Keras LSTM example
Deep Learning AI Specialization
Pyalsaaudio Library
Mycroft-precise


Code Appendix


    import sys
    import time
    import random
    import queue
    import threading
    
    from termcolor import cprint
    from utils.audio import ResumableMicrophoneStream
    from utils.detect_queue import DetectQueue
    from utils.credentials import init_credentials
    
    from trigger_detector import TriggerDetector
    
    from speech_to_text import SpeechToText
    from lang_understand import LangUnderstand
    from text_to_speech import TextToSpeech
    from lyrics_generator import LyricsGenerator
    
    from tft_display import TFTDisplay
    
    # Audio recording parameters
    SAMPLE_RATE = 16000
    CHUNK_SIZE = int(SAMPLE_RATE / 10)  # 100ms
    STREAM_LIMIT = 5000
    
    
    class Andrew(object):
        """the rap voice assisstant
        """
        def __init__(self, detect_model="data/andrew2.net",
                            lyrics_model="data/keras_model_1200.h5",
                            lyrics_chars="data/chars.pkl"):
            # microphone
            self.mic = ResumableMicrophoneStream(SAMPLE_RATE, CHUNK_SIZE)
    
            # wake word detector
            self.detector = TriggerDetector(detect_model)
    
            # speech and language services
            self.speech_client = SpeechToText()
            self.luis = LangUnderstand()
            self.tts = TextToSpeech()
    
            # lyrics generator model
            self.lyrics_gen = LyricsGenerator(lyrics_model, lyrics_chars)
    
            self.pred_queue = DetectQueue(maxlen=5)
            self.is_wakeup = False
    
            # pytft display
            self.tft = TFTDisplay()
            self.tft_queue = queue.Queue()
            self.tft_thread = threading.Thread(target=self.tft_manage, args=())
            self.tft_thread.daemon = True
            self.tft_thread.start()
    
            self.notify("hi_there")
    
    
        def notify(self, topic="hi_there", is_async=False, audio_path="data/audio"):
            # Notify with local preset audio files
            from os.path import join, isfile
            audio_file = join(audio_path, f"{topic}.wav")
            if not isfile(audio_file):
                return
    
            self.tts.play_file(audio_file, is_async)
    
    
        def generate_rap(self, topic="", beat_path="data/beat"):
            """Generate rap and play
            """
            tts = self.tts
            lyrics_gen = self.lyrics_gen
    
            response = tts.generate_speech(f"hey, I can rap about {topic}")
            tts.play(response, True)
    
            # Generate based on topic
            lyrics_output = lyrics_gen.generate(topic)
    
            # Generate speech
            lyrics_speech = tts.generate_speech(lyrics_output)
    
            # Select beat
            beat_index = random.randint(0, 20)
    
            # Play beat and lyrics
            tts.play_file(f'{beat_path}/beat_{beat_index}.wav', True)
            tts.play(lyrics_speech)
    
        def get_weather_message(self, city="Ithaca"):
            import requests, json, os
            api_key = os.getenv('WEATHER_APIKEY')
            base_url = "https://api.openweathermap.org/data/2.5/weather?"
            city_name = f"{city},us"
            complete_url = f"{base_url}q={city_name}&units=imperial&APPID={api_key}"
            try:
                response = requests.get(complete_url)
                res = response.json()
                msg_weather = f"Today, it's {res['weather'][0]['description']} in {city}. "
                msg_temp = f"The temperature is {int(res['main']['temp'])} degrees."
                return msg_weather + msg_temp
            except:
                pass
    
            return ""
    
    
        def intent_recognize(self, text=""):
            """Recognize intent
            """
            luis = self.luis
            tts = self.tts
    
            # Get result from language understanding engine
            luis_result = luis.predict(text)
            intent = luis_result.top_scoring_intent.intent
    
            if intent == "Freestyle":
                entities = luis_result.entities
                entity_topic = "rap"
                if (len(entities) > 0):
                    entity = entities[0]
                    cprint(f'The topic is {entity.entity}', 'cyan')
                    entity_topic = entity.entity
                self.generate_rap(entity_topic)
    
            elif intent == "Weather":
                response = tts.generate_speech("I will tell you the weather in Ithaca.")
                tts.play(response)
    
                weather = self.get_weather_message()
                response = tts.generate_speech(weather)
                tts.play(response)
    
            else:
                self.notify("sorry")
    
    
        def tft_manage(self):
            """Manage TFT display through state
            """
            self.tft.display_text("Andrew is waking up")
            status = {'state': 'None'}
    
            while True:
                if status['state'] is 'wait':
                    self.tft.display_wave()
    
                elif status['state'] is 'listen':
                    self.tft.display_wave((0, 255, 0))
    
                # Update the status
                try:
                    update = self.tft_queue.get(block=False)
                    if update is not None:
                        status = update
    
                except queue.Empty:
                    continue
    
    
        def start(self):
            """Start listening and interacting
            """
            tft = self.tft
            tts = self.tts
    
            # Init stream
            with self.mic as stream:
    
                self.tft_queue.put({'state': 'listen'})
    
                while True:
                    if not self.is_wakeup:
                        stream.closed = False
    
                        while not stream.closed:
    
                            stream.audio_input = []
                            audio_gen = stream.generator()
    
                            for chunk in audio_gen:
                                if not self.is_wakeup:
    
                                    prob = self.detector.get_prediction(chunk)
    
                                    self.pred_queue.append(prob > 0.6)
                                    print('!' if prob > 0.6 else '.', end='', flush=True)
    
                                    if (self.pred_queue.count >= 2):
                                        self.notify("hi")
                                        cprint(' Trigger word detected! \n', 'magenta')
                                        self.pred_queue.clear()
                                        self.is_wakeup = True
                                        stream.pause()
                                        break
                    else:
                        cprint('Speech to text\n', 'green')
    
                        time.sleep(1)
                        stream.closed = False
    
                        try:
                            voice_command = self.speech_client.recognize(stream)
    
                            cprint(f'{voice_command}\n', 'yellow')
                            cprint('Recognition ended...\n', 'red')
    
                            stream.pause()
    
                            #tft.display_text(f'"{voice_command}"')
    
                            if ("goodbye" in voice_command):
                                self.notify("see_you")
                                exit()
    
                            if ("sorry" in voice_command):
                                self.notify("its_ok")
    
                            else:
                                cprint('Recognize intents...', 'cyan')
                                self.intent_recognize(voice_command)
    
                        except Exception as e:
                            cprint(f'Error: {e}', 'red')
    
                        self.is_wakeup = False
    
    
    def main():
    
        # set credentials for cloud services
        init_credentials()
    
        # init and start andrew
        andrew = Andrew()
        andrew.start()
    
    
    if __name__ == "__main__":
        main()